Report: Exploratory Data Analysis on "Most streamed tracks in Spotify 2023" dataset.

Motivation: Understand the factors that drive the popularity, engagement, and overall success of top Spotify tracks in 2023.

Data collection

Data cleaning & pre-processing

Comments

  1. There are a few tracks (ex: "Flowers", "Daylight") which have the same title, but aren't from the same artist.
  2. There are only two features that present null values: in_shazam_charts, key.
  3. To understand what each variable represents, check out: https://www.kaggle.com/datasets/nelgiriyewithana/top-spotify-songs-2023.

Comments

  1. In this project, we have dropped all entries with null values. While this approach is acceptable here due to the large dataset and the educational nature of the project, it is generally not ideal. In practice, one should consider the following steps: (1) impute missing values when appropriate, (2) remove features with many missing values if they are not critical, and (3) assess the impact of missing data before deciding on the cleaning strategy. Filling missing values is a crucial step in data analysis, so we will work on this in the following projects.

  2. We have removed one corrupted entry that contained metadata in the "streams" variable, instead of a valid numerical value.

  3. It wasn't necessary in this case, but sometimes you may need to rename the features for an easier interpretation later on.

Univariate analysis

Section objective: Get an idea of the statistical properties of the features (relevant statistical properties, statistical distribution, etc.)

Comments

After analyzing the statistical distribution of the features in the dataset, we draw the following insights about the most streamed songs on Spotify in 2023:

  1. Number of Artists: Most tracks feature a single artist, although collaborations between two or even three artists are relatively common.

  2. Release Year: The majority of tracks were released recently, reflecting listeners’ preference for contemporary music that aligns more closely with current tastes and trends.

  3. Release Month: Songs tend to be released at the beginning of the year, just before summer, or toward the end of the year. Interestingly, there’s a noticeable dip in releases during the summer months—possibly due to users disconnecting during holidays, reducing engagement with new releases.

  4. Release Day: There's a subtle trend toward releasing songs at the beginning of the month, perhaps to align with playlist updates or marketing cycles.

  5. Musical Attributes: Popular tracks tend to have slower tempos (lower BPM), are highly danceable and energetic, and exhibit low acousticness and instrumentalness. This suggests a preference for rhythm-driven, vocal-centric music over purely instrumental or acoustic songs.

  6. Valence: The valence variable, which measures musical positivity, shows a nearly symmetrical distribution around 50%. This indicates a balanced preference—listeners enjoy both upbeat and moodier tracks in similar measure.

  7. Liveness: Presence of live performance elements is common in the most streamed tracks. Users may enjoy the immersive feel of live-sounding tracks.

Correlation analysis

Objective: Reveal linear relationships between features.

Comments

  1. The feature "streams", shows a strong positive correlation with "in_spotify_playlists" and "in_apple_playlists".

  2. A weaker, yet noticeable, correlation is also observed with "in_spotify_charts" and "in_apple_charts".

  3. These correlations are expected: tracks added to playlists are more likely to be streamed, and appearing in charts increases a track’s exposure to users.

  4. Additionally, some weaker — but meaningful — (anti)correlations with "streams" are found in features such as "releasedyear", "danceability%", and "speechiness_%". These trends are also reasonable: streaming platform audiences tend to be younger and are often more engaged with recent music. Danceable tracks tend to be catchier and more frequently streamed, while highly instrumental or speechy tracks (e.g. rap) may appeal to a narrower audience.

Note: Correlation analysis is a valuable tool for identifying linear relationships between features. However, it's important to keep two key limitations in mind:

  1. Correlation may miss nonlinear relationships — features with low correlation coefficients might still be meaningfully related in a nonlinear way.

  2. Correlation does not imply causation — just because two variables move together does not mean that one causes the other. For example, there is a strong correlation between the number of ice creams sold and the number of people who go swimming. But this doesn't mean that buying ice cream causes people to swim, or vice versa. Instead, both are influenced by a third common factor: hot weather.

Multivariate analysis

Section objective: Explore dependencies between numerical features beyond linear relationships to reveal patterns in the dataset.

No clear patterns/dependencies are revealed with these plots.

Note

At first glance, the variable "streams" might seem like the most important metric for assessing a track's success. However, it's essential to consider the following:

1) Release timing matters: Not all tracks were released at the same time. Songs released earlier naturally have more time to accumulate total streams.

2) Artist fame skews visibility: Major artists (Taylor Swift, Ed Sheeran, etc.) benefit from (well-deserved) large fanbases and extensive media coverage. Thus, their exposure is largely a result of their fame, not necessarily the quality of a specific track.

Hence, we will also focus on two additional key metrics: "in_spotify_charts" and "in_spotify_playlists":

Comments

Comments

Top tracks analysis

The tracks in this dataset are all highly successful. So far, we have explored the characteristics of this group of a 857 top tracks. In this section, we take it a step further, and examine the most elite songs: what is takes not just to be successful, but to produce a true hit in today's extremely competitive music market.

Comments

Top artists analysis

In this last section, we analyze the top 10 artists (most streams) with the aim of discovering common features in their music.

Comments